To begin the exploratory analysis, the preprocessed and cleaned dataset is loaded from an external R script. This script includes all previous data wrangling steps to ensure the consistency and readiness of the dataset for further analysis.

source("extracted_Data_Cleaning_code.R", local = knitr::knit_global())
## 
## Attaching package: 'mice'
## The following object is masked from 'package:stats':
## 
##     filter
## The following objects are masked from 'package:base':
## 
##     cbind, rbind
## New names:
## Rows: 160000 Columns: 17
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (9): asin, text, title.x, parent_asin, user_id, main_category, title.y, ... dbl
## (7): ...1, rating, helpful_vote, timestamp, average_rating, rating_numbe... lgl
## (1): verified_purchase
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
## 
##  iter imp variable
##   1   1  price
##   1   2  price
##   1   3  price
##   1   4  price
##   1   5  price
##   2   1  price
##   2   2  price
##   2   3  price
##   2   4  price
##   2   5  price
##   3   1  price
##   3   2  price
##   3   3  price
##   3   4  price
##   3   5  price
##   4   1  price
##   4   2  price
##   4   3  price
##   4   4  price
##   4   5  price
##   5   1  price
##   5   2  price
##   5   3  price
##   5   4  price
##   5   5  price
## Warning: Number of logged events: 1

3. Descriptive Statistics

To understand the overall structure and central tendencies of the dataset, several descriptive statistics are computed at both global and category-specific levels. These summaries offer insights into key variables such as helpfulness, review length, sentiment score, and product price.

##3.1. Overall Summary Statistics

The following output displays a basic summary of selected numeric variables and aggregated descriptive measures across the entire dataset.

summary(sample_data %>% 
          dplyr::select(helpful_vote, review_length, review_age, sentiment_score, price))
##   helpful_vote      review_length       review_age    sentiment_score  
##  Min.   :   0.000   Min.   :    7.0   Min.   :  775   Min.   :-71.000  
##  1st Qu.:   0.000   1st Qu.:   83.0   1st Qu.: 1676   1st Qu.:  1.000  
##  Median :   0.000   Median :  204.0   Median : 2537   Median :  4.000  
##  Mean   :   1.963   Mean   :  412.1   Mean   : 2712   Mean   :  5.214  
##  3rd Qu.:   1.000   3rd Qu.:  477.0   3rd Qu.: 3490   3rd Qu.:  8.000  
##  Max.   :3580.000   Max.   :30272.0   Max.   :10408   Max.   : 77.000  
##      price        
##  Min.   :   0.00  
##  1st Qu.:   6.91  
##  Median :  13.69  
##  Mean   :  31.03  
##  3rd Qu.:  25.49  
##  Max.   :6980.00
 gt(sample_data %>%
  summarise(
    Count = n(),
    `Helpful Rate (%)` = round(mean(helpful_binary, na.rm = TRUE) * 100, 1),
    `Avg. Helpful Votes` = round(mean(helpful_vote, na.rm = TRUE), 2),
    `Avg. Review Length` = round(mean(review_length, na.rm = TRUE), 0),
    `Avg. Review Age (days)` = round(mean(review_age, na.rm = TRUE), 0),
    `Avg. Sentiment Score` = round(mean(sentiment_score, na.rm = TRUE), 2),
    `Avg. Price ($)` = round(mean(price, na.rm = TRUE), 2)
  ))
Count Helpful Rate (%) Avg. Helpful Votes Avg. Review Length Avg. Review Age (days) Avg. Sentiment Score Avg. Price ($)
160000 31.7 1.96 412 2712 5.21 31.03

3.2. Summary by Product Category

T o detect differences across product types, the dataset is grouped by category. For each category, key indicators such as average helpful votes and sentiment scores are calculated.

category_summary <- sample_data %>%
  group_by(category) %>%
  summarise(
    Count = n(),
    `Helpful Rate (%)` = round(mean(helpful_binary) * 100, 1),
    `Avg. Sentiment` = round(mean(sentiment_score, na.rm = TRUE), 2),
    `Avg. Review Length` = round(mean(review_length, na.rm = TRUE), 0),
    `Avg. Price ($)` = round(mean(price, na.rm = TRUE), 2)
  )

gt(category_summary)
category Count Helpful Rate (%) Avg. Sentiment Avg. Review Length Avg. Price ($)
Appliances 20000 32.0 3.55 393 51.58
Beauty 20000 27.9 5.07 290 28.89
Beauty_Personal_Care 20000 29.8 6.07 377 28.30
Books 20000 37.6 7.06 635 16.34
Electronics 20000 29.6 5.45 498 49.21
Fashion 20000 20.4 6.09 271 30.87
Pet_Supplies 20000 28.1 4.98 387 27.66
Software 20000 48.5 3.43 447 15.40

##3.3. Most Helpful Reviews

This section presents the top 10 reviews with the highest number of helpful votes to illustrate examples of reviews deemed valuable by other users.

sample_data %>%
  arrange(desc(helpful_vote)) %>%
  dplyr::select(helpful_vote, verified_purchase, rating, sentiment_score, text) %>%
  head(10)
## # A tibble: 10 × 5
##    helpful_vote verified_purchase rating sentiment_score text                   
##           <dbl> <lgl>              <dbl>           <int> <chr>                  
##  1         3580 FALSE                  1             -26 "For those of you who …
##  2         2868 FALSE                  1               0 "I have never been a M…
##  3         1295 TRUE                   5              11 "I adore this adult co…
##  4         1073 FALSE                  5               8 "I have spent more tim…
##  5          993 TRUE                   5              10 "[[VIDEOID:c396a5ff674…
##  6          951 FALSE                  5              52 "I'll admit, I've been…
##  7          914 FALSE                  4              13 "Without a doubt, Rose…
##  8          899 FALSE                  4              38 "TAKEAWAY TO GET THE M…
##  9          883 FALSE                  5              41 "I bought this book af…
## 10          813 FALSE                  5              -5 "This book does a very…

##3.4. Reviews with the Highest Joy Scores

To explore the emotional landscape of the reviews, those with the highest “joy” scores are highlighted. This helps identify highly positive review examples.

sample_data %>%
  arrange(desc(joy)) %>%
  dplyr::select(joy, helpful_vote, rating, text) %>%
  head(5)
## # A tibble: 5 × 4
##     joy helpful_vote rating text                                                
##   <int>        <dbl>  <dbl> <chr>                                               
## 1     1            2      1 I like the idea of the spray deodorant but it had a…
## 2     1            3      5 I recently started doing my own nails and these are…
## 3     1            3      4 I bought this kit for my daughter. She has used it …
## 4     1            1      5 bought this for my 13yr old and she loves it, the t…
## 5     1            1      1 Stays on for about 4 hours. And if you get the mult…

##3.5. Review Length Group Analysis

To investigate how the length of a review relates to its helpfulness and sentiment, reviews are grouped into three categories based on their character count: Short (<100), Medium (100–499), and Long (≥500). For each group, the average number of helpful votes, average sentiment score, and total count are calculated.

 sample_data %>%
  mutate(length_group = case_when(
    review_length < 100 ~ "Short",
    review_length < 500 ~ "Medium",
    TRUE ~ "Long"
  )) %>%
  group_by(length_group) %>%
  summarise(avg_helpful = mean(helpful_vote),
            avg_sentiment = mean(sentiment_score),
            count = n())
## # A tibble: 3 × 4
##   length_group avg_helpful avg_sentiment count
##   <chr>              <dbl>         <dbl> <int>
## 1 Long               4.88           9.49 37968
## 2 Medium             1.29           4.62 75671
## 3 Short              0.665          2.68 46361

##3.6. Extreme Sentiment Reviews Reviews with extreme sentiment scores are isolated to better understand the characteristics of emotionally intense feedback.

sample_data %>%
  filter(sentiment_score > 8 | sentiment_score < -8) %>%
  dplyr::select(sentiment_score, helpful_vote, rating, text) %>%
  head(10)
## # A tibble: 10 × 4
##    sentiment_score helpful_vote rating text                                     
##              <int>        <dbl>  <dbl> <chr>                                    
##  1              11            0      5 These are amazing!!! I got a larger set …
##  2              12            0      5 Oh my these are fabulous and make your e…
##  3               9            7      4 I really like this marker.  It has a fin…
##  4               9            1      1 Not a good quality, more watery, doesn't…
##  5              12            1      5 I always love the ordinaryAnd so far I’v…
##  6              12            7      5 I wandered onto Amazon looking for a set…
##  7              10            0      5 Absolutely beautiful craftsmanship, easy…
##  8              17            1      5 I really like the concept and stencils, …
##  9               9            4      5 I purchased the Pantene Natural Hair Co-…
## 10              11            0      5 My client loved it…. Passion twists, but…

##3.7. Normalized Sentiment To adjust for the impact of review length on sentiment magnitude, a normalized sentiment score is computed by scaling sentiment per 100 characters.

sample_data <- sample_data %>%
  mutate(sentiment_per_100 = (sentiment_score / review_length) * 100)

4. Visualizations

This histogram displays the distribution of product reviews that received up to 20 helpful votes. The data reveal a right-skewed pattern, indicating that the vast majority of reviews receive very few helpful votes. This suggests that many reviews remain largely unseen or engage few users. The concentration of helpful votes in a limited number of reviews highlights the uneven visibility and engagement dynamics within the platform.

ggplot(sample_data %>% filter(helpful_vote <= 20), aes(x = helpful_vote)) +
  geom_histogram(binwidth = 1, fill = "#00BFC4", color = "white") +
  labs(
    title = "Distribution of Helpful Votes",
    x = "Helpful Votes",
    y = "Count"
  ) +
  theme_minimal()

This bar chart illustrates the proportion of reviews marked as helpful across different product categories. Categories are ordered by helpfulness rate, from highest to lowest. The plot highlights that certain product categories receive a consistently higher share of helpful reviews, suggesting that the perceived usefulness of reviews may vary depending on the type of product. This could reflect differences in consumer engagement, product complexity, or the relevance of review content across categories.

sample_data %>%
  group_by(category) %>%
  summarise(helpful_rate = mean(helpful_binary)) %>%
  ggplot(aes(x = reorder(category, -helpful_rate), y = helpful_rate)) +
  geom_col(fill = "#00BFC4") +
  labs(title = "Helpful Review Rate by Product Category",
       x = "Category", y = "Helpful Rate") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

This histogram shows the distribution of review lengths, measured by the number of characters. Most reviews fall within the shorter to medium length range, with a gradual decline in frequency as the length increases. The x-axis is limited to 1,500 characters to focus on the majority of reviews and reduce the influence of extreme outliers. This visualization helps identify common patterns in review verbosity and supports further analysis on how review length may relate to perceived helpfulness.

ggplot(sample_data, aes(x = review_length)) +
  geom_histogram(binwidth = 50, fill = "#00BFC4", color = "white") +
  scale_x_continuous(limits = c(0, 1500)) +
  labs(
    title = "Distribution of Review Lengths",
    x = "Number of Characters", y = "Count of Reviews"
  ) +
  theme_minimal()
## Warning: Removed 7655 rows containing non-finite outside the scale range
## (`stat_bin()`).
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_bar()`).

This histogram shows the distribution of AFINN sentiment scores across reviews. Most scores are centered around zero, indicating that reviews tend to be emotionally neutral or only mildly expressive. Extreme positive or negative sentiments are relatively uncommon.

ggplot(sample_data, aes(x = sentiment_score)) +
  geom_histogram(binwidth = 1, fill = "#00BFC4", color = "white") +
  labs(
    title = "Distribution of Sentiment Scores(AFINN)",
    x = "Sentiment Score", y = "Number of Reviews"
  ) +
  theme_minimal()

This histogram presents the distribution of sentiment scores calculated using the BING lexicon, which classifies words as either positive or negative.

ggplot(sample_data, aes(x = bing_score)) +
  geom_histogram(binwidth = 1, fill = "#00BFC4", color = "white") +
  labs(
    title = "Distribution of Sentiment Scores (BING)",
    x = "Sentiment Score", y = "Number of Reviews"
  ) +
  theme_minimal()
## Warning: Removed 7787 rows containing non-finite outside the scale range
## (`stat_bin()`).

This histogram visualizes the distribution of product prices in the dataset, limited to a maximum of $300.

ggplot(sample_data, aes(x = price)) +
  geom_histogram(binwidth = 5, fill = "#00BFC4", color = "white") +
  scale_x_continuous(limits = c(0, 300)) +
  labs(
    title = "Distribution of Product Prices",
    x = "Price ", y = "Number of Products"
  ) +
  theme_minimal()
## Warning: Removed 1565 rows containing non-finite outside the scale range
## (`stat_bin()`).
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_bar()`).

This plot displays the density distributions of AFINN sentiment scores across different product categories. Each facet represents a separate category, allowing for comparison of sentiment patterns within and between groups. The x-axis is limited to scores between 0 and 25 to focus on the most common sentiment range.

ggplot(sample_data, aes(x = sentiment_score, fill = category)) +
  geom_density(alpha = 0.6, color = "black") +
  xlim(0, 25) +
  facet_wrap(~category, scales = "free_y") +
    scale_fill_manual(values = rep("#00BFC4", length(unique(sample_data$category)))) +
  labs(
    title = "Sentiment Distribution by Product Category (AFINN)",
    x = "Sentiment Score", y = "Density"
  ) +
  theme_minimal()+
  theme(legend.position = "none")
## Warning: Removed 20051 rows containing non-finite outside the scale range
## (`stat_density()`).

This density plot illustrates the distribution of BING sentiment scores across various product categories. Each panel (facet) corresponds to a specific category, allowing a side-by-side view of how sentiment varies. The x-axis is restricted to scores between 0 and 25 to highlight the most concentrated sentiment range.

ggplot(sample_data, aes(x = bing_score, fill = category)) +
  geom_density(alpha = 0.6, color = "black") +
  xlim(0, 25) +
  facet_wrap(~category, scales = "free_y") +
    scale_fill_manual(values = rep("#00BFC4", length(unique(sample_data$category)))) +
  labs(
    title = "Sentiment Distribution by Product Category (BING)",
    x = "Sentiment Score", y = "Density"
  ) +
  theme_minimal()+
  theme(legend.position = "none") 
## Warning: Removed 27863 rows containing non-finite outside the scale range
## (`stat_density()`).

This bar chart displays the average emotion scores across all reviews, based on the NRC emotion lexicon. Each bar represents the mean occurrence of a specific emotion (e.g., joy, trust, anger) normalized across the dataset, providing an overview of the emotional tone present in the text corpus.

overall_emotion_summary <- sample_data %>%
  summarise(across(c(joy, trust, fear, anger, sadness, disgust, surprise, anticipation),
                   mean, na.rm = TRUE)) %>%
  pivot_longer(cols = everything(), names_to = "emotion", values_to = "mean_value")
## Warning: There was 1 warning in `summarise()`.
## ℹ In argument: `across(...)`.
## Caused by warning:
## ! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
## Supply arguments directly to `.fns` through an anonymous function instead.
## 
##   # Previously
##   across(a:b, mean, na.rm = TRUE)
## 
##   # Now
##   across(a:b, \(x) mean(x, na.rm = TRUE))
ggplot(overall_emotion_summary, aes(x = reorder(emotion, -mean_value), y = mean_value)) +
  geom_col(show.legend = FALSE,fill = "#00BFC4") +
  labs(title = "Average Emotion Scores (Overall)",
       x = "Emotion", y = "Mean Score") +
  theme_minimal()

This bar chart shows the average number of helpful votes received by reviews, grouped by whether the reviewer was a verified purchaser. It visually compares the perceived helpfulness between verified and non-verified reviews.

sample_data %>%
  group_by(verified_purchase) %>%
  summarise(avg_helpful = mean(helpful_vote, na.rm = TRUE)) %>%
  ggplot(aes(x = as.factor(verified_purchase), y = avg_helpful, fill = as.factor(verified_purchase))) +
 geom_col(fill = "#00BFC4")+
  scale_x_discrete(labels = c("0" = "Not Verified", "1" = "Verified")) +
  labs(
    title = "Average Helpful Votes by Verified Status",
    x = "Verified Purchase",
    y = "Average Helpful Votes"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

This scatter plot illustrates the relationship between review age (in days) and the number of helpful votes a review has received. Both variables are log-transformed to handle skewed distributions. A LOESS trend line is added to visualize the overall pattern in the data.

sample_data %>%
  filter(review_age > 0) %>%
  ggplot(aes(x = log10(review_age), y = log10(helpful_vote + 1))) +
  geom_point(alpha = 0.3, color = "#00BFC4") +
  geom_smooth(method = "loess", se = FALSE, color = "black", linetype = "dashed") +
  labs(
    title = "Relationship Between Review Age and Helpful Votes ",
    x = "Review Age",
    y = "Helpful Votes ",
    caption = "0 helpful votes included using log10(helpful_vote + 1)"
  ) +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

This bar chart displays the distribution of reviews based on their verified purchase status. The x-axis categorizes reviews as either “Verified” or “Not Verified”, while the y-axis represents the total number of reviews in each group. The chart provides an overview of the relative volume of verified and unverified reviews in the dataset, which is useful for understanding the underlying balance in purchase credibility.

sample_data %>%
  count(verified_purchase) %>%
  ggplot(aes(x = as.factor(verified_purchase), y = n, fill = as.factor(verified_purchase))) +
   geom_col(fill = "#00BFC4")+
  scale_x_discrete(labels = c("0" = "Not Verified", "1" = "Verified")) +
  labs(
    title = "Distribution of Verified Purchase Status",
    x = "Verified Purchase",
    y = "Number of Reviews"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

sample_data$review_year <- format(as.Date(sample_data$review_date), "%Y")

temporal_summary <- sample_data %>%
  group_by(review_year, verified_purchase) %>%
  summarise(mean_age = mean(review_age, na.rm = TRUE),
            count = n()) %>%
  ungroup()
## `summarise()` has grouped output by 'review_year'. You can override using the
## `.groups` argument.

This line chart visualizes the yearly proportion of verified and non-verified reviews. The x-axis represents the year in which the review was posted, while the y-axis shows the percentage of reviews that were either verified or not verified. By plotting these trends over time, the chart helps reveal shifts in platform authenticity, such as an increase in verified purchases, which may reflect changes in platform policy or user behavior.

temporal_verified <- sample_data %>%
  group_by(review_year) %>%
  summarise(
    verified = mean(verified_purchase == TRUE, na.rm = TRUE),
    non_verified = mean(verified_purchase == FALSE, na.rm = TRUE)
  ) %>%
  pivot_longer(cols = c("verified", "non_verified"), names_to = "type", values_to = "rate")


ggplot(temporal_verified, aes(x = as.numeric(review_year), y = rate, color = type)) +
  geom_line(size = 1.2) +
  geom_point(size = 2) +
  scale_color_manual(values = c("verified" = "#1F78B4", "non_verified" = "#E31A1C"),
                     labels = c("Verified", "Not Verified")) +
  scale_y_continuous(labels = scales::percent) +
  labs(
    title = "Verified vs. Non-Verified Reviews Over Time",
    x = "Review Year",
    y = "Review Proportion (%)",
    color = "Review Type"
  ) +
  theme_minimal()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

This bar chart displays the frequency distribution of star ratings assigned to reviews. The x-axis represents the rating values (typically from 1 to 5 stars), while the y-axis shows the number of reviews for each rating. This visualization provides insight into the overall satisfaction levels of customers.

ggplot(sample_data, aes(x = rating)) +
  geom_bar(fill = "#00BFC4") +
  labs(title = "Distribution of Star Ratings")

This bar chart presents the average number of helpful votes received for reviews at each star rating level. The x-axis indicates the star rating, while the y-axis shows the corresponding average helpful vote count. This visualization helps identify how perceived helpfulness varies with customer satisfaction levels.

ggplot(sample_data, aes(x = rating, y = helpful_vote)) +
  stat_summary(fun = mean, geom = "col", fill = "#00BFC4") +
  labs(title = "Average Helpful Votes by Rating")

sample_data <- sample_data %>%
  mutate(review_age_group = case_when(
    review_age < 1000 ~ "Recent",
    review_age >= 1000 & review_age < 3000 ~ "Medium",
    review_age >= 3000 ~ "Old"
  ))



summary_data <- sample_data %>%
  group_by(rating, verified_purchase, review_age_group) %>%
  summarise(avg_helpful = mean(helpful_vote, na.rm = TRUE)) %>%
  ungroup()
## `summarise()` has grouped output by 'rating', 'verified_purchase'. You can
## override using the `.groups` argument.

This plot shows how helpfulness scores vary depending on the star rating, whether the review is from a verified purchaser, and how old the review is. By breaking the data into different age groups, it helps highlight how the relationship between rating and helpfulness can shift over time and between verified and non-verified reviews.

ggplot(summary_data, aes(x = factor(rating), y = avg_helpful, color = as.factor(verified_purchase), group = verified_purchase)) +
  geom_line(size = 1.2) +
  geom_point(size = 2) +
  facet_wrap(~ review_age_group) +
  scale_color_manual(values = c("#E41A1C", "#377EB8"), labels = c("Not Verified", "Verified")) +
  labs(
    title = "Interaction Between Star Rating, Verified Purchase, and Review Age on Review Helpfulness",
    x = "Star Rating",
    y = "Average Helpful Votes",
    color = "Verified Purchase"
  ) +
  theme_minimal(base_size = 13)

sample_data <- sample_data %>%
  mutate(length_group = cut(review_length,
                            breaks = quantile(review_length, probs = c(0, 0.33, 0.66, 1), na.rm = TRUE),
                            labels = c("Short", "Medium", "Long"),
                            include.lowest = TRUE))

This scatter plot explores the relationship between the sentiment score (measured using the AFINN lexicon) and the length of the review. Review lengths are log-transformed for better visualization, and reviews are grouped by length category (Short, Medium, Long) to observe patterns across different types. A LOESS trend line is added to highlight the general trend within each group

sample_data %>%
  ggplot(aes(x = sentiment_score, y = log(review_length + 1), color = length_group)) +
  geom_point(alpha = 0.2, size = 0.8) +
  geom_smooth(method = "loess", se = FALSE) +
  labs(
    title = "Relationship Between Review Length and Afinn Score",
    x = "AFINN Sentiment Score",
    y = "Log Review Length"
  ) +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

sample_data %>%
  ggplot(aes(x = bing_score, y = log(review_length + 1), color = length_group)) +
  geom_point(alpha = 0.2, size = 0.8) +
  geom_smooth(method = "loess", se = FALSE) +
  labs(
    title = "Relationship Between Review Length and Bing Sentiment Score",
    x = "Bing Sentiment Score",
    y = "Log Review Length"
  ) +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 7787 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : pseudoinverse used at 2
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : neighborhood radius 1
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : reciprocal condition number 0
## Warning: Removed 7787 rows containing missing values or values outside the scale range
## (`geom_point()`).

This line plot compares the average NRC sentiment scores across different review length groups (Short, Medium, Long). Each line represents a specific emotion, helping to visualize how emotional tone varies with the length of a review.

sample_data %>%
  group_by(length_group) %>%
  summarise(across(c(anger, anticipation, disgust, fear, joy, negative, positive, sadness, surprise, trust), 
                   mean, na.rm = TRUE)) %>%
  pivot_longer(cols = -length_group, names_to = "sentiment", values_to = "mean_score") %>%
  ggplot(aes(x = length_group, y = mean_score, color = sentiment, group = sentiment)) +
  geom_line(size = 1.2) +
  geom_point(size = 3) +
  labs(
    title = "Average NRC Sentiment Scores by Review Length Groups",
    x = "Review Length Groups",
    y = "Average Sentiment Score",
    color = "Sentiment"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    legend.position = "bottom"
  )

This violin plot displays the distribution of helpful votes across different product categories. The y-axis is log-transformed to better visualize the wide range of vote counts, revealing how the helpfulness of reviews varies by category.

ggplot(sample_data, aes(x = category, y = helpful_vote)) +
  geom_violin(fill = "#00BFC4") +
  scale_y_log10() +
  labs(
    title = "Helpful Votes Distribution by Category",
    x = "Category",
    y = "Helpful Votes (Log)"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))
## Warning in scale_y_log10(): log-10 transformation introduced infinite values.
## Warning: Removed 109214 rows containing non-finite outside the scale range
## (`stat_ydensity()`).

This combined histogram and density plot illustrates the distribution of review age in days, showing how recent or old the reviews are within the dataset. The histogram captures the frequency of reviews by age, while the overlaid density curve highlights the overall trend in their temporal distribution.

ggplot(sample_data, aes(x = review_age)) +
  geom_histogram(bins = 50, fill = "#00BFC4", color = "white", alpha = 0.8) +
  geom_density(aes(y = ..count..), color = "black", size = 1, adjust = 1.5) +
  labs(
    title = "Distribution of Review Age (days)",
    x = "Review Age (days)",
    y = "Number of Reviews"
  ) +
  theme_minimal()
## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

This summary table shows how review characteristics differ across product categories and verified purchase status. It includes the number of reviews, average helpful votes, review length, star rating, and AFINN sentiment score for each group, offering a quick overview of key patterns in the data.

summary_table <- sample_data %>%
  group_by(category, verified_purchase) %>%
  summarise(
    N = n(),
    Avg_Helpful = mean(helpful_vote, na.rm = TRUE),
    Avg_Length = mean(review_length, na.rm = TRUE),
    Avg_Rating = mean(rating, na.rm = TRUE),
    Avg_Afinn = mean(sentiment_score, na.rm = TRUE),
    .groups = "drop"
  )

This code builds a summary table that adds Bing sentiment proportions to the previous category-level overview. For each product category and verification status, it calculates the percentage of reviews labeled as positive, neutral, or negative based on the Bing lexicon. These percentages are then merged with the earlier summary (which included helpfulness, length, rating, and AFINN score), resulting in a more complete view of how emotional tone and review quality vary across groups. Finally, the table is formatted using flextable for clearer presentation.

bing_pct_table <- sample_data %>%
  filter(!is.na(bing_label)) %>%
  group_by(category, verified_purchase, bing_label) %>%
  summarise(N = n(), .groups = "drop") %>%
  group_by(category, verified_purchase) %>%
  mutate(Share = N / sum(N)) %>%
  dplyr::select(-N) %>%
  pivot_wider(names_from = bing_label, values_from = Share, values_fill = 0) %>%
  mutate(
    Percent_Negative = round(Negative * 100, 1),
    Percent_Positive = round(Positive * 100, 1),
    Percent_Neutral = round(Neutral * 100, 1)
  ) %>%
  dplyr::select(category, verified_purchase, Percent_Negative, Percent_Neutral, Percent_Positive)

final_summary <- left_join(summary_table, bing_pct_table, 
                           by = c("category", "verified_purchase"))



colnames(final_summary)[colnames(final_summary) == "Percent_Negative"] <- "BING_Negative (%)"
colnames(final_summary)[colnames(final_summary) == "Percent_Positive"] <- "BING_Positive (%)"
colnames(final_summary)[colnames(final_summary) == "Percent_Neutral"]  <- "BING_Neutral (%)"



b <- final_summary %>%
  flextable() %>%
  set_header_labels(
    category = "Category",
    verified_purchase = "Verified",
    Avg_Helpful = "Avg. Helpful",
    Avg_Length = "Avg. Length",
    Avg_Rating = "Avg. Rating",
    Avg_Afinn = "AFINN (Mean)",
    `BING_Negative (%)` = "BING Negative (%)",
    `BING_Positive (%)` = "BING Positive (%)"
  ) %>%
  fontsize(part = "header", size = 11) %>%  
  fontsize(part = "body", size = 10) %>%
  font(part = "all", fontname = "Times New Roman") %>%
  autofit()